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We propose the application of a high-speed maximum likelihood clustering algorithm to detect temporal 
financial market states, using correlation matrices estimated from intraday market microstructure fea¬ 
tures. We first determine the ex-ante intraday temporal cluster configurations to identify market states, 
and then study the identified temporal state features to extract state signature vectors which enable 
online state detection. The state signature vectors serve as low-dimensional state descriptors which can 
be used in learning algorithms for optimal planning in the high-frequency trading domain. We present 
a feasible scheme for real-time intraday state detection from streaming market data feeds. This study 
identifies an interesting hierarchy of system behaviour which motivates the need for time-scale-specific 
state space reduction for participating agents. 
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1. Introduction 

The financial market represents a prime example of an observable complex adaptive system. Many 
heterogeneous adaptive agents, such as traders, portfolio managers, market makers and regulatory 
authorities, interact non-linearly over time with each other and the electronic exchange, allowing for 
the emergence of complex behaviours beyond that expected based on intrinsic agent characteristics. 
Many authors have viewed financial markets through this lens, considering analogues with physical 
systems to formulate models which aid our understanding of observed system characteristics (see 
Arthur (1995), Arthur et al. (1997), Brock (1993), Hommes (2001), Wilcox and Gebbie (2014) and 
the references therein). Recent technological advances, accelerated by a highly competitive industry, 
have allowed for the efficient generation, storage and retrieval of financial data at micro time scales, 
providing a rich record of the price formation process as a laboratory for intensive study. The 
field of market microstructure developed to study the characteristics and behaviours of financial 
system dynamics at this scale (see O’Hara (1998), Madhavan (2000), Biais et al. (2005), Hinton 
(2007), Gengay et al. (2010), Baldovin et al. (2015) for a comprehensive discussion). In particular, 
as intraday trading and investment processes become increasingly automated, understanding the 
system dynamics at varying intraday time scales is critical for an efficient trajectory through the 
system to be mapped by participating agents. 

This paper aims to use a physical analogy to the ferromagnetic Potts model at thermal equilib¬ 
rium to describe object interactions, before deriving an unsupervised clustering algorithm, where 
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both the number of clusters and configuration emerges from the data (Blatt et al. (1996, 1997), 
Wiseman et al. (1998), Giada and Marsili (2001)). Treating intraday time periods as objects, the 
algorithm will be used to identify intraday market states from observed market microstructure 
features. Although Marsili (2002) used a similar approach to classify days as states, the authors 
are unaware of another study which applies this technique to intraday period clustering using 
multiple features. In addition, a high-speed Parallel Genetic Algorithm (PGA) will be used for 
efficient computation of the cluster configurations, with absolute computation speeds conducive to 
overnight or even intraday recalibration of identified states (Hendricks et al. (2015)). 

The results reveal an interesting hierarchy of system behaviour at different time scales. Statis¬ 
tically significant power-law fits to configuration characteristics suggest scale-invariant behaviour 
which may translate to persistent features in market states. In addition, the power-law fits yield 
different scaling exponents at the different time scales, suggesting that the system may be at criti¬ 
cality at each scale, possibly with different universality classes characterising behaviour (Dacorogna 
et al. (1996), Gabaix et al. (2003), Emmert-Streib and Dehmer (2010), Mastromatteo and Marsili 
(2011)). This motivates the importance of time-scale specific information when planning in this 
domain. Here we are considering a particular case of calendar time when investigating scale-related 
phenomena, however we note that there are alternative scales used to measure the financial system. 
There is a rich history in the literature which has aimed to directly model the event time founda¬ 
tions of market microstructure processes. The seminal work of Garman (1976), which used point 
processes to model order book events, forms the basis of many subsequent event time approaches 
to modelling transaction and quote data. An important extension of this view is the vector autore¬ 
gressive model for trades and quotes developed by Hasbrouck (1988, 1991) and Engle and Russell 
(1998). A complementary approach introduces the concept of intrinsic time, which aims to measure 
trading opportunities in reference to specific features of traded stocks, for example, using the rate 
of trading to modify calendar or chronological time. These are discussed by Muller et al. (1995) 
and Derman (2002). The more recent use of Hawkes processes to model mutually-exciting order 
book events is an important return to the idea of viewing events as a foundational concept when 
modelling transactions and order book dynamics (Large (2000), Toke and Pomponio (2012), Bacry 
et al. (2015), Abergel and Jedidi (2015)). 

Easley et al. (2012) introduce the volume time paradigm for high-frequency trading, with the 
clock ticking according to the number of events (proxied by trade volume) flowing through the 
system. This is a pragmatic attempt to reconcile the foundational event-based paradigm introduced 
by Garman (1976) with the wide use of chronological or calendar time. They argue that machines 
operate on a clock which is not chronological, but rather related to the number of cycles per 
instruction initiated by an event (Easley et al. (2012), Patterson and Hennessy (2013)). This 
allows one to measure time in terms of frequency of changes in information, as measured by 
trading volumes. When one considers the complex event processing paradigm which underpins 
many automated trading systems in financial markets (Adi et al. (2006)), one can appreciate the 
suitability of the event-based clock and the view that the calendar time clock is a legacy convenience 
from the low-frequency, human-trader-driven world. As the shift from human-driven to machine- 
driven trading dominates financial markets, the study of event-time-scale phenomena has become 
increasingly important and warrants further exploration. 

While the identified market states reveal many interesting insights, trading agents would benefit 
from being able to detect online (or in real-time) which state they are currently in. We develop 
a novel technique which extracts the characteristic signature of market activity from each of the 
identified states, and uses this as the basis for an online state detection algorithm. In one appli¬ 
cation, this is used to construct 1-step transition probability matrices, which can be refined online 
and used in optimal planning algorithms. 

This paper proceeds as follows: Section 2 describes a non-parametric clustering approach using 
a physical Potts model analogy. Section 3 uses the Potts model analogy to derive a maximum 
likelihood estimator for the optimal cluster configuration. Section 4 describes the idea of clustering 
time periods as objects in order to identify market states. Section 5 describes the parallel genetic 
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algorithm for high-speed detection of the temporal cluster configuration using the maximum like¬ 
lihood estimator. Section 6 describes an approach for online state detection. Section 7 discusses 
scale-invariant properties of the cluster configurations and how these may be exploited for efficient 
online state detection. Section 8 discusses the data used, workflow and results. Section 9 provides 
some concluding remarks and suggestions for further research. 


2. Super-paramagnetic clustering 

Blatt et al. (1996, 1997) and Wiseman et al. (1998) proposed a novel non-parametric clustering ap¬ 
proach, based on an analogy to the ferromagnetic Potts model at thermal equilibrium. By assigning 
a Potts spin variable to each object and introducing a short-range distance-dependent ferromag¬ 
netic interaction field, regions of aligned spins emerge, which are analogous to groups of objects in 
the same cluster, where spin alignment suggests object homogeneity (Wang and Swendsen (1990)). 

More formally, consider a (/-state Potts model with spins Si = 1, ...,q for i = 1,..., where N is 
the total number of objects in the system. The cost function is given by the following Hamiltonian; 

^ = - E (1) 

Si,Sj£S 

where the spins Si can take on (/-states and the coupling of the and object are governed 
by Jij. In the case of object clustering for a data sample, a candidate configuration is given by 
the set S = where s* represents the cluster group index to which the i^^ object belongs. 

One can consider the coupling parameters Jij as being a function of the correlation coefficient Cij 
(Kullmann et al. (2000), Giada and Marsili (2001)). This is used to specify a distance function that 
is decreasing with distance between objects. If all the spins are related in this way, then each pair 
of spins is connected by some non-vanishing coupling Jij = Jij (Cij). This allows one to interpret 
Si as a Potts spin in the Potts model Hamiltonian with Jij decreasing with the distance between 
objects (Blatt et al. (1996), Kullmann et al. (2000)). The case where there is only one cluster 
can be thought of as a ground state. As the system becomes more excited, it could break up into 
additional clusters. Each cluster would have specific Potts magnetisations, even though the nett 
magnetisation can be zero for the complete system. Generically, the correlation would then be both 
a function of time and temperature in order to encode both the evolution of clusters, as well as 
the hierarchy of clusters as a function of temperature. In the basic approach, one is looking for the 
lowest energy state that fits the data. 


3. A maximum likelihood approach 

In order to parameterise the model efficiently, one can choose to make an ansatz for the data 
generative function (Noh (2000)) and use this to develop a maximum-likelihood approach (Giada 
and Marsili (2001)), rather than explicitly solving the Potts Hamiltonian numerically (Blatt et al. 
(1996), Kullmann et al. (2000)). A number of authors have considered this approach for object clus¬ 
tering (McLachlan et al. (1996), Giada and Marsili (2001), Mungan and Ramasco (2010)), however 
we follow the proposition by Giada and Marsili (2001). A summary exposition will be presented 
here (as shown in Hendricks et al. (2015)), with a full derivation available in the Appendices. 
According to the Noh (2000) ansatz, the generative model of the time series associated with the 
object can then be written as 


Xiit) = 9s,Vsi + - Jsji ( 2 ) 

where the cluster-related influences are driven by rjs. and the object-specific effects by e*, both 
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treated as Gaussian random variables with unit variance and zero mean^. The relative contribution 
is controlled by the intra-cluster coupling parameter . The Noh-Giada-Marsili model encodes the 
idea that objects which have something in common belong in the same cluster, object membership 
in a particular cluster is mutually exclusive and intra-cluster correlations are positive. 

If one takes Equation 2 as a statistical hypothesis, it is possible to compute the probability 
density P{{xi}\G,S) for any given set of parameters {G,S) = ({ffs}, by observing the data 
set {xi},i,s = as a realisation of the common component of Equation 2 as follows Giada 

and Marsili (2001): 


D IN \ 

P{{xi]\G,S) = \ . (5) 

d=i \i=i / 

In Equation 5, N is the number of objects and D is the number of feature measurements for each 
object. The variable 5 is the Dirac delta function and (...) denotes the mathematical expectation. 
Eor a given cluster structure S, the likelihood is maximal when the parameter gg takes the values 
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* 
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for Hg > 1, 
for Ug < 1. 


( 6 ) 


Ug in Equation 6 denotes the number of objects in cluster s, i.e. 

N 

IT'S — ^ ^ ^Si,s- ( 1 ) 

i=l 

The variable Cg is the internal correlation of the cluster, denoted by the following equation: 

N N 

Cs = ^ ^ CijSg^^gSg^^g. (8) 

i=i j=i 

The variable Cij is the Pearson correlation coefficient of the data, denoted by the following 
equation: 


C^J = 


XiXj 




r • 2 11 11 /T’ • 2 I 


(9) 


The maximum likelihood of structure S can be written as P {G*,S\xi) oc exp^'^^'^), where the 
resulting likelihood function per feature Cc is denoted by 


GciS) = ^ X] flog—-F (n^ - 1) log- 


s:ns>l 


( 10 ) 


^This form of the price model ensures that the self correlation of a stock is one and independent of the cluster coupling. This 
can be seen by computing the self correlation E[x^] and using that clusters and stock unique process are unit variance zero 
mean processes 


+ y 1 - = 9si + (1 - 9si) = 1- 

This is not a unique choice, another possible choice often used is 


E[( 


psp 

•\/l + 9b- 


1 


■Vsi 


■\/l + 9s, 




1 + 9si 
1 + 9si 


= 1 . 


(3) 

(4) 
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From Equation 10, it follows that £c = 0 for clusters of objects that are uncorrelated, i.e. where 
< 7 * = 0 or Cs = Us or when the objects are grouped in singleton clusters for all the cluster indexes 
{us = 1). Equations 8 and 10 illustrate that the resulting maximum likelihood function for S 
depends on the Pearson correlation coefficient Cij and hence exhibits the following advantages in 
comparison to conventional clustering methods; 

• It is unsupervised: The optimal number of clusters is unknown a priori and not fixed at the 
outset 

• The interpretation of results is transparent in terms of the model, namely Equation 2. 

Giada and Marsili state that max5£c(‘5) provides a measure of structure inherent in the cluster 
configuration represented by the set S = {si,...,Sn} Giada and Marsili (2001). The higher the 
value, the more pronounced the structure. 

We note that the particular choice of Gaussian innovations in Equation 2 is convenient, since the 
Pearson correlation coefficient then completely characterises pairwise interactions amongst objects 
in the system (Giada and Marsili (2001)). This is a necessary condition, given the physical analogy 
and link to the motivating Hamiltonian given in Equation 1. The application of this technique to 
high-frequency financial time series may motivate a more prudent assumption for the underlying 
object and cluster dynamics, incorporating jumps to better model the price formation process 
at this scale. However, the use of, say, jump diffusion innovations would require an alternative 
dependency metric, such as Levy copulas, to completely capture object interactions (Gont and 
Tankov (2004), McNeil et al. (2015)), requiring a careful re-derivation of the appropriate likelihood 
function. This will be explored in further research. 


4. Detecting temporal states using clustering 

The data generative model specified by Equation 2 is sufficiently generic that it can be applied 
to a diverse set of problem domains, where object and cluster innovations can be assumed to be 
Gaussian. In the financial domain, initial applications focused on clustering stocks based on price 
changes (Giada and Marsili (2001), Hendricks et al. (2015)), however Marsili (2002) proposed that 
this technique could be used to cluster time periods in order to identify temporal market states. 
Days were grouped into clusters based on the closing price performance of the chosen universe of 
stocks, demonstrating a meaningful classification of market-wide activity which persists through 
time (Marsili (2002)). We propose that a similar approach can be applied to discover intraday 
temporal states, clustering time periods based on the performance of multiple observable market 
microstructure features. A practical trading system often has access to a real-time market data 
feed, from which multiple features can be extracted to describe various aspects of the evolving 
limit order book. In addition, examining temporal cluster configurations at varying time scales 
can suggest a hierarchy of system behaviour, providing insights into exogenous and endogenous 
market activity. This can also assist trading agents in developing optimal trajectories for varying 
objectives, such as stock acquisition or liquidation at minimal cost. In particular, for an agent 
tasked to learn an optimal policy (state-action mapping), the grouping of temporal periods into 
market states based on market microstructure feature performance provides a novel scheme to 
reduce the dimensionality of the state space and promote efficient learning. 

In this paper, we will focus on the emergent hierarchy of system behaviour at different time scales 
and explore a scheme for online state detection. In one application, this leads to a system of 1-step 
state transition probability matrices at varying scales, which can be refined online in real-time. 
These can be used in optimal planning schemes where Markovian dynamics are assumed and state 
persistence can be exploited. 
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5. A high-speed Parallel Genetic Algorithm implementation 

The likelihood function specified in Equation 10 serves as the objective function in a metaheuristic 
optimisation routine, where candidate cluster configurations are evaluated and successively im¬ 
proved until a configuration best explains the inherent structure in a given correlation matrix. 
Giada and Marsili (2001) used simulated annealing and deterministic maximisation to approxi¬ 
mate the maximum likelihood structure. While appropriate for their study, these techniques are 
inherently computationally intensive and may require a significant amount of time to converge for 
large-scale problems. In addition, it is unclear whether such trajectory-based methods are appro¬ 
priate for the multi-featured clustering problem considered in this paper, since Giada and Marsili 
(2001) clustered objects (stocks) based on a single feature (price returns). Hendricks et al. (2015) 
and Gieslakiewicz (2014) propose the use of a high-speed Parallel Genetic Algorithm (PGA), lever¬ 
aging the Streaming Multiprocessors (SMs) of a Graphics Processing Unit (GPU), where Equation 
10 is used as a fitness function to find the cluster configuration which best approximates the 
maximum likelihood structure. They implemented a G-based master-slave PGA using the Nvidia 
Gompute Unified Device Architecture (CUBA) development environment, using the Single Pro¬ 
gram Multiple Data (SPMD) architecture to enumerate the GPU thread hierarchy with population 
members for concurrent application of genetic operators. 

Consider the problem of finding the cluster configuration of n objects. Then, given N candidate 
cluster configuration structures making up the population, 

{'?! ) •• •) 

52 = {s?,...,4} 

Sn = 

would be mapped to the GPU thread hierarchy using a 2-dimensional grid, as shown in Table 1. 


CUBA thread block grid 



Si 

52 ... 

Sn 

objecti 

■Sl 

'®1 


object^ 


S2 

*2 

object^ 


On ■ ■ ■ 



Table 1.: Mapping of population to CUDA thread hierarchy 


The PGA was applied to the relatively small problem of finding the cluster configuration of 
18 objects, however demonstrated fast absolute computation time compared to state-of-the-art 
methods, with the promise of scalability within the constraints of the GPU architecture used 
(Hendricks et al. (2015), Gieslakiewicz (2014)). We have restricted our analysis to intraday temporal 
periods within one month, however this still yields up to 2208 objects in the 5-minute case. Table 
2 shows the specifications and capabilities of the two candidate GPUs and Table 3 shows the PGA 
parameter values and number of objects for each of the time scales investigated. The mapping of 
candidate configurations to the GPU thread hierarchy under the SPMD paradigm results in an 
upper bound on the permissible number of objects and population size. Hendricks et al. (2015) 
further recognised the importance of ensuring that the population size is large enough relative 
to the number of objects, to ensure sufficient population diversity for convergence to the best 
approximation of the maximum likelihood structure within a finite number of generations. Smaller 
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populations often lead to sub-optimal algorithm terminations and inconsistent results. For the 60- 
minute, 30-minute and 15-minute cases, the Nvidia Geforce GTX765m notebook GPU had sufficient 
capability to determine the optimal cluster configurations from sufficiently large populations. The 
5-minute case demanded a larger capacity GPU, and the Nvidia Geforce GTX Titan X provided 
the necessary additional SMs, GUDA cores and global memory to facilitate efficient computation. 


Graphics Processing Unit (GPU) 

Feature 

Nvidia Geforce GTX 765m 

Nvidia Geforce GTX Titan X 

Gompute capability 

3.5 

5.2 

GUDA cores 

768 

3072 

Memory 

2048MB 

12228MB 

Number of streaming multiprocessors 

16 

96 

Max threads / thread block 

1024 

1024 

Thread block dimension 

32 

32 

Max thread blocks / multiprocessor 

16 

32 


Table 2.: Graphics Processing Unit specification and capabilities 


Time 

scale 

Number of 
periods (objects) 

Population 

size 

Generations 

Stall 

generations 

Mutation 

probability 

Crossover 

probability 

Computation 
Time (sec)* 

5-minute 

2208 

4000 

4000 

1000 

0.09 

0.9 

603 {D) 

15-minute 

736 

1000 

4000 

500 

0.09 

0.9 

382 (A) 

30-minute 

368 

800 

4000 

500 

0.09 

0.9 

215 (A) 

60-minute 

184 

600 

4000 

500 

0.09 

0.9 

132 (A) 


Table 3.: Parameter values and computation times for Parallel Genetic Algorithm 

* Average from 20 independent runs; N refers to the GTX765m Notebook GPU and D refers to the GTX Titan X Desktop GPU. 


We note that the number of generations and stall generations indicated in Table 3 are higher than 
one would typically specify for a genetic algorithm, since these promote potential over-fitting to the 
prescribed dataset. Recall that our application is to find the candidate cluster configuration which 
best explains the structure inherent in a given correlation matrix. Thus we are not concerned with 
out-of-sample validity, but would rather prefer to find a configuration with the highest likelihood 
value. The higher number of generations and stall generations, together with the mutation operator, 
promotes convergence to a higher likelihood structure. The average computation times indicated 
in Table 3 are not overly onerous, suggesting that for practical application, overnight or even 
intraday estimation of cluster configurations to capture recent dynamics is feasible. The proposed 
PGA thus offers an efficient, scalable alternative for finding the best approximation of the optimal 
cluster configuration, suitable for clustering objects on multiple observable features. 


6. State Signature Vectors for online state detection 

The clustering procedure described thus far can be used as an unsupervised algorithm to group 
temporal periods into states according to feature similarity, however this can only reveal the ex- 
ante temporal states and is not suitable for online detection. Upon examination of the resulting 
cluster configurations, we noted that each node refers to a particular time period, with an associated 
signature of market activity. Furthermore, if two time periods appear in the same cluster, given the 
data generative model assumed in Equation 2, we conjecture that it is the relative similarity of their 
characteristic signatures of market activity which resulted in their assignment to the same cluster. 
Using this idea, given a cluster configuration of temporal periods into market states, it is possible 
to extract a state signature vector (SSV) which summarises the signature of market activity across 
stocks and time periods for each state. Then, if one is faced with a new candidate feature vector 
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(FV), the market state assignment can be determined by using the closest match within the set of 
pre-determined SSVs computed offline. FVs are easy to compute online from a streaming datafeed 
and state assignment can be achieved using a simple Euclidean distance computation. To make 
these ideas concrete, consider the example illustrated in Figure 1. 



STATE 1 



STATE 2 


State Signature Vector (STATE 1) 

0.5 



Price Spread Vol Volimb 

State Signature Vector (STATE 2) 



0.5 


Price Spread Vol Volimb 


New Feature Vector 


State Signature Vector (STATE I) 

II I 


State Signature Vector (STATE 2} 


I I 


Price Spread Vol Volimb 


New Feature 
Vector 


STATE 1 


Detect temporal H Compute state s/gnature vectors for each I A new feature vector arrives I Calculate distance between new feature Assign to state 

clusters / states H state I I vector and existing state s/gnature vectors based on closest 

I II ^1 match 


Figure 1.: Illustration of online state assignment based on identified state signature vectors. 


Here, we compute two SSVs from the identified states, and use these as a basis for assigning a 
new FV to a market state. This is based on a simple Euclidean distance metric, 

argmiUpIlFV — 

where p is the index of the identified states. 

In this paper, we have used four features to characterise market activity at intraday scale. These 
include: trade price, trade volume, spread and quote volume imbalance. In particular, we consider 
the relative change in each of these features. For example, based on a set of feature measurements 
jr5min 5-minute scale, we would compute 


^5mm _ rbmin 

A fbmin ^ J t—1 

fbmin 

Jt-1 

for all £ jr5min^ Pqj, initial temporal cluster detection stage, these “feature returns” are 

calculated for each stock and concatenated before computing the time period correlation matrix. 

For the extraction of SSVs from significant states, we compute average feature returns across 
member periods and stocks. For example, consider the case of 15-minute period clustering. If one 
state (cluster) consisted of 2 periods (09:15 - 09:30 and 15:15 - 15:30), then we would find the 
average trade price, trade volume, spread and quote volume imbalance returns across stocks in each 
period (i.e. two 4-element vectors), then average across these two vectors to get a single 4-element 
vector, which would be the representative SSV for that state. 

Although this results in a loss of information, we conjecture that the average signature of feature 
returns broadly captures the state of market activity. The SSVs for each time-scale configuration 
are illustrated in Figures 4, 6, 8 and 10. Following this approach, the FVs calculated in the on¬ 
line environment would constitute the same averages of feature returns, before matching to the 
appropriate SSV. We note that this is merely one candidate scheme for extracting SSVs which are 
conducive to online matching for state assignment, however alternative schemes for extraction of 
SSVs which preserve state-specific information will be explored in future work. The chosen features 
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do not represent an exhaustive set of possible explanatory factors for intraday market activity, but 
rather were chosen based on the relative ease of their online construction from streaming Level-1 
market data feeds JSE (2015). Additional features can be considered in future work. 


7. Scale-invariant characteristics of states 

The detected temporal cluster configurations can be further analysed to determine whether any 
characteristics exhibit scale-invariant behaviour. In particular, a visual inspection of the cluster 
configurations shown in Section 8.4 led us to conjecture a possible power-law fit for cluster sizes. 
Many physical and man-made systems exhibit characteristics which follow a power-law functional 
form, and its unique mathematical properties sometimes lead to surprising physical insights (Gabaix 
et al. (2003), Clauset et al. (2009)). Many authors have investigated the nature of information and 
forecasting at different time scales in financial markets (see Dacorogna et al. (1996), Zhang et al. 
(2005), Emmert-Streib and Dehmer (2010) as examples). Eor our application, the existence of 
different critical exponents for the best power-law fits at different time scales may suggest different 
universality classes which characterise the system activity at each scale. In fact, Mastromatteo and 
Marsili (2011) discuss the notion that, for a complex adaptive system, distinguishable models can 
only be gleaned when the system is near criticality. Thus, if financial markets truly are a complex 
adaptive system, measurable quantities from the dynamics at each scale should yield a statistically 
significant power-law fit. Although it is difficult to quantify the exact nature of these scale-specific 
behaviours or universality classes, their apparent existence suggests that investment and trading 
decisions would benefit from time-scale-specific state space information. This would enhance the 
efficacy of intraday policies which aim to find optimal trajectories through the system. 

Given the difficulties of identifying statistically significant power-law fits to empirical quantities 
Bauke (2007), we incorporated the maximum likelihood fitting procedure provided by Glauset et al. 
(2009). Outputs from their functions include the scaling parameter of the proposed power-law fit, 
a Kolmogorov-Smirnov test for the goodness-of-fit of the proposed model to the data, the lower- 
bound for the fit if the tail distribution follows a power-law and the log-likelihood of the data under 
the power-law fit. 

We note that a detected temporal cluster configuration results in a set of homogeneous market 
states, although it is not clear which states are significant, i.e. likely to persist, or merely transient. 
Using all identified states may result in spurious state assignments if one uses the online algorithm 
described in Section 6. This leads to the need for some selection criteria for significant states, before 
extracting SSVs. Gandidate criteria include using intra-cluster connectedness (c^) or cluster size 
with some form of thresholding procedure, however these heuristics are inherently subjective. The 
power-law fit to cluster size provides one candidate objective approach for state selection. Under 
the assumption that the system is near criticality when we find a stable parameter calibration, 
choosing the states which best fit the power-law functional form may aid in isolating those states 
which best capture the system behaviour at that scale, i.e. filter the stable, persistent states from 
the noise. This provides an objective mechanism for selecting significant states, reducing the set of 
SSVs which form the basis for the online state detection algorithm. 


8. Data and results 
8.1. Data description 

The data for this study constituted tick-level trades and top-of-book quotes for 42 stocks on the 
Johannesburg Stock Exchange (JSE) from 1 November 2012 to 30 November 2012. This data was 
sourced from the Thomson Reuters Tick History (TRTH) database. The raw data was aggregated 
according to the time-scale considered (5-minute, 15-minute, 30-minute and 60-minute), before cal¬ 
culating the required features (change in trade price, trade volume, spread and volume imbalance). 
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The 42 stocks considered represent the prevailing constituents of the FTSE/JSE Top40 headline 
index, which contains the 42 largest stocks by market capitalisation in the main board’s FTSE/JSE 
All-Share index. 

The objects of interest for the cluster analysis are the time periods. Table 4 provides an example 
of the required data returns matrix, from which a correlation matrix is computed for time period 
similarity. This is the only required input for the clustering algorithm. 



Table 4.: Illustration of data returns matrix as an input for estimation of 15-minute period correlations 


8.2. Workflow 

Figure 2 illustrates the process workflow and tools used for performing the temporal cluster analysis. 
The TRTH tick data is stored in a MongoDB noSQL database, with optimised query indexes 
for efficient data retrieval. A bespoke Application Programming Interface (API) was written to 
transport data from MongoDB to our primary scientific computing platform, MATLAB. The data 
is used to instantiate a High Frequency Time Series (HFTS) object in MATLAB, which allows for 
efficient merging, resampling and aggregation of large-scale irregularly-spaced tick data. Based on 
a chosen time-scale, the data is aggregated, features are extracted and returns calculated, before 
computing the time period correlation matrix. The PGA was implemented in CUDA-C using 
Nvidia Nsight and the Microsoft Visual Studio development environment. The compiled PGA was 
called from the MATLAB environment to run the temporal cluster analysis. The resulting cluster 
configuration is transported to the MATLAB workspace, from which we can determine the power- 
law fits, extract SSVs, estimate online clusters and compute transition probability matrices. Using 
the stock, time period, cluster configuration and correlation data, a MATLAB script was written to 
generate an XML file containing the required node and edge metadata for an undirected graph to 
import into Gephi. Gephi was used for cluster configuration visualisation, as described in Section 

8.3. 
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generate cluster 
visualisation (Gephi) 


Figure 2.: Flowchart illustrating workflow to determine the temporal cluster conflguration from a time period correlation matrix, 
identify persistent states, estimate temporal cluster configuration using feature vectors and determine state transition probabilities. 
Processes are coloured by platform: MongoDB = Yellow, MATLAB = Green, CUDA-C = Orange, Gephi = Purple. 


8 . 3 . Visualisation 

For the cluster configuration visualisation, we made use of the Gephi graph visualisation and 
manipulation software package (Bastian et al. (2009)), with a customised enumeration of nodes 
and edges and the Fruchterman and Reingold (1991) node spacing algorithm. The presence of 
an edge between nodes indicates membership to the same cluster, while edge thickness provides 
a visual impression of object-object correlation, and hence intra-cluster connectedness. For the 
visualisations which follow, we chose to colour the nodes by intraday time period, in order to 
illuminate any calendar time effects in the detected states. According to this scheme, the same 
time on different days will receive the same colour. These visualisations are shown in Figures 3, 5, 
7, 9, 12, 13, 14 and 15. 


8 . 4 . Results discussion 

For each set of results, we consider 8 hours of continuous trading activity each day, from 09:00 
to 17:00, for the duration of one month. Figure 3 shows the temporal cluster configuration of 60- 
minute periods. We first note that the detection of non-trivial clusters from microstructure-based 
time correlations indicates that intraday dynamics may be reducible to a finite set of temporal 
states. Considering the time-of-day colour shading, we notice two clusters which exhibit market 
activity characteristics which coincide with morning and afternoon times. The dark green cluster 
refers to the first hour of the trading day (09:00 to 10:00), which incorporates opening auction and 
subsequent activity. We note that the South African equity market is strongly influenced by global 
market activity, in part due to local stocks being listed on multiple exchanges in the UK, USA, 
Europe and Australia (JSE (2014)). During the period considered in this analysis, the UK market 
open occurred at 10:00 SAST and US market open at 15:30 SAST. The UK market open has a 
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significant impact on local trading dynamics, with the 10:00 to 11:00 periods dispersing across 
clusters with no discernible time-of-day correlation. We note a contiguous dark orange cluster 
emerge from 15:00 to 16:00, as the US market starts to participate in local trading activity. This 
pattern of market activity broadly corroborates these exogenous market effects from global markets. 
Figure 4 shows the SSVs extracted from the significant states selected from Figure 3. As discussed 
in Section 7, we used the Xmin statistic from the power-law fit to the tail distribution of cluster 
sizes to determine the significant states. For the 60-minute periods, the most significant power-law 
fit was for cluster sizes > 13, resulting in 6 significant states. The resulting SSVs are all relatively 
different, when considering the magnitude and direction of each of the average change in feature 
values. This ensures greater certainty in the state assignment of an online FV. 

Figure 5 shows the temporal cluster configuration of 30-minute periods. We see a larger number 
of states emerge as the granularity increases, with 60-minute states being dissected based on finer- 
grained market activity. The dark green and dark orange contiguous morning and afternoon states 
still persist at this scale, although endogenous system characteristics begin to mask previously 
identified exogenous characteristics. We note that there is no defined hierarchy emerging, in that a 
set of 30-minute clusters cannot be combined to form the 60-minute clusters identified previously, 
further highlighting time-scale-specific behaviour. Figure 6 shows the SSVs of significant states, 
based on the 10 clusters with a size > 14. 

Figure 7 shows the temporal cluster configuration of 15-minute periods. We notice increasing 
time-of-day diversity in each of the identified clusters, further highlighting endogenous system 
activity. The red contiguous cluster is associated with the period from 16:30 to 16:45, suggesting 
a particular signature of market activity leading into the closing auction, which starts at 16:50. 
The UK and US related effects seem to have a weaker impact at this scale, with exchange-specific 
rules having a more dominant effect. As a result, we see a larger variety of SSVs in Figure 8, some 
with similar profiles seen at the 30-minute scale, but with a larger focus on magnitude, rather than 
merely direction. 

Figure 9 shows the temporal cluster configuration of 5-minute periods. Here we see quite a differ¬ 
ent profile of system behaviour. There are a large number of singletons, which could be attributed 
to the amount of noise in the data at this scale, making it more difficult to discern significant 
structure. We notice an interesting time-of-day correlation with detected clusters, however broad 
periods (morning, lunch, afternoon) appear to have been dissected into contiguous blocks based 
on state-specific market activity. The 5-minute time scale is starting to capture the effects of au¬ 
tomated, rule-based trading agents which shows quite a different characteristic signature. This 
further highlights the importance of studying market activity profiles at the scale at which you in¬ 
tend to participate. Even when one considers the associated SSVs in Figures 10 and 8, the 5-minute 
and 15-minute studies exhibit the same number of significant states using the power-law criterion, 
however the combinations of direction and magnitude for the feature values are quite different. 

Figure 11 illustrates the results of the power-law fits to the cluster size empirical distribution 
at each time scale. Each fit to the tail distribution exhibits a Kolmogorov-Smirnov p-value > 0.1 
(assuming a null hypothesis of a power-law fit), suggesting a strong fit of the power-law functional 
form for the given scaling factor (a) and minimum size {xmin) (Clauset et al. (2009)). In addition, 
we note the a exponents are different for each of the time scales considered. This evidence of sta¬ 
tistically signihcant power-law fits at each measured scale is consistent with the notion of financial 
markets as a complex adaptive system, and that the system is near criticality at each measured 
time scale (Mastromatteo and Marsili (2011)). A further study should verify whether this suggests 
different universality classes of system behaviour at different time scales, however these preliminary 
results do indicate the presence of some complex hierarchy of system behaviour, motivating the 
need for scale-specific temporal analysis. 

Eigure 12 shows the estimated 60-minute cluster configuration for the same period (1 November 
2012 to 30 November 2012), but where the distance of each period’s FV to the identified SSVs is 
used as the criterion for state assignment. This is a simple in-sample test to determine whether 
the proposed scheme for online state assignment can discern the structure suggested by direct 
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application of the clustering algorithm. By comparing Figure 12 and Figure 3, we notice that 
the online state assignment algorithm does recover the contiguous morning and afternoon states, 
but more broadly intuitively separates periods into: opening auction and early morning trading 
state, UK market open state, two lunch states, US market open state and a end-of-day/closing 
auction state. This completely captures the exogenous market effects, which is a strong validation 
for the approach. Table 5 shows an empirical 1-step transition probability matrix calculated from 
the states shown in Figure 12, illustrating one potential application of this technique. The 1-step 
transitions show a particular preference, suggesting some predictability which can be exploited 
by trading agents. To be clear, the online assignment of a FV to a state means that we have 
developed a mechanism to detect which state we are currently in, using the prevailing set of SSVs. 
The transition matrix can be used and updated online, and for optimal planning in the domain. 

Figures 13 to 15 and Tables 6 to 8 show the estimated cluster configurations and transition 
probability matrices using the SSVs at the specified time scale. It is interesting to observe the 
dilution of the exogenous time-of-day effects as one approaches the 5-minute scale. 

Figure 16 illustrates the stability of the online state assignment algorithm out-of-sample. Given 
that the state assignment of an online FV is based on the minimum Euclidean distance to prede¬ 
termined SSVs, we compute the best match distance for each of the FVs in a sample and use a 
boxplot to visualise the empirical distribution. This paper proposes offline estimation of SSVs used 
for online state detection. The online cluster configurations shown in Figures 12, 13, 14 and 15 
use FVs from the ex-ante period, i.e. the same period used to estimate the SSVs. It is prudent to 
determine whether state assignment using out-of-sample (ex-post) FVs deviate significantly from 
in-sample assignment, and gauge the out-of-sample efficacy of the SSVs before re-estimation is nec¬ 
essary. Given the computation times shown in Section 5, in practice one could estimate the SSVs 
overnight for each trading day. We have considered SSVs estimated from the period 1 November 
2012 to 30 November 2012, and compared the resulting online states from the ex-ante period (1 
November 2012 to 30 November 2012) with states from an ex-post period (3 December 2012 to 
7 December 2012, one week after SSV estimation). From these results, it appears that 60-minute 
states cannot be reliably determined ex-post using the online detection algorithm, given the ob¬ 
served higher range of best match Euclidean distances. The 30-minute, 15-minute and 5-minute 
time scales all exhibit acceptable ex-post best match distances, with the exception of a few outliers. 
From these preliminary results, it appears that the algorithm can be used to reliably determine 
30-minute, 15-minute and 5-minute states for a relatively short ex-post period following SSV es¬ 
timation. A more robust study should consider the precise half-life of the SSVs, but given the 
relatively fast computation time, this is unlikely to be a practical concern. 
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Figure 3.: JSE TOP40 60-minute temporal clusters for the period Ol-Nov-2012 to 30-Nov-2012, representing 184 distinct periods. 
Each node represents a 60-minute period during a trading day, with the colour shading indicating the time-of-day (Morning = green, 
Lunch = yellow, Afternoon = red) and node connectedness indicating cluster membership. 
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Figure 4.: JSE TOP40 60-minute cluster state signature vectors for the period Ol-Nov-2012 to 30-Nov-2012. Each plot illustrates the 
average change in trade price, spread, trade volume and quote volume imbalance across member periods and stocks for each of the 
clusters with a size > Xmin from the truncated power-law fit. Cluster size and intra-cluster correlation are shown in parentheses. 
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Figure 5.: JSE TOP40 30-minute temporal clusters for the period Ol-Nov-2012 to 30-Nov-2012, representing 368 distinct periods. 
Each node represents a 30-minute period during a trading day, with the colour shading indicating the time-of-day (Morning = green, 
Lunch = yellow, Afternoon = red) and node connectedness indicating cluster membership. 
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Figure 6.: JSE TOP40 30-minute cluster state signature vectors for the period Ol-Nov-2012 to 30-Nov-2012. Each plot illustrates the 
average change in trade price, spread, trade volume and quote volume imbalance across member periods and stocks for each of the 
clusters with a size > Xmin from the truncated power-law fit. Cluster size and intra-cluster correlation are shown in parentheses. 
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Figure 7.: JSE TOP40 15-minute temporal clusters for the period Ol-Nov-2012 to 30-Nov-2012, representing 736 distinct periods. 
Each node represents a 15-minute period during a trading day, with the colour shading indicating the time-of-day (Morning = green, 
Lunch = yellow, Afternoon = red) and node connectedness indicating cluster membership. 
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Figure 8.: JSE TOP40 15-minute cluster state signature vectors for the period Ol-Nov-2012 to 30-Nov-2012. Each plot illustrates the 
average change in trade price, spread, trade volume and quote volume imbalance across member periods and stocks for each of the 
clusters with a size > Xmin from the truncated power-law fit. Cluster size and intra-cluster correlation are shown in parentheses. 
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Figure 9.: JSE TOP40 5-minute temporal clusters for the period Ol-Nov-2012 to 30-Nov-2012, representing 2208 distinct periods. 
Each node represents a 5-minute period during a trading day, with the colour shading indicating the time-of-day (Morning = green, 
Lunch = yellow, Afternoon = red) and node connectedness indicating cluster membership. 
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Figure 10.: JSE TOP40 5-minute cluster state signature vectors for the period Ol-Nov-2012 to 30-Nov-2012. Each plot illustrates the 
average change in trade price, spread, trade volume and quote volume imbalance across member periods and stocks for each of the 
clusters with a size > Xmin from the truncated power-law fit. Cluster size and intra-cluster correlation are shown in parentheses. 
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Figure 11.: Testing conjecture of power law fit for varying time scale cluster sizes, applying the Clauset, Shalizi and Newman algorithm 
Clauset et al. (2009). a indicates the scaling parameter of the proposed fit, Pvaiue indicates the p-value from a Kolmogorov-Smirnov 
test for the goodness-of-fit of the proposed model to the data, Xmin indicates the lower-bound for the power law fit and C is the 
log-likelihood of the data {x > Xmin) under the power law fit. 
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Figure 12.: Estimated 60-minute clusters using identified state signature vectors. The Euclidean distance is calculated between each 
temporal period’s feature vector and the state signature vectors. Cluster index assignment is based on the state signature vector which 
yields the minimum distance. 
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Table 5.: Empirical 1-step transition probability matrix for 60-minute states, based on identified temporal cluster configuration. State 
transitions with a probability > 0 are highlighted in green. 
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Figure 13.: Estimated 30-minute clusters using identified state signature vectors. The Euclidean distance is calculated between each 
temporal period’s feature vector and the state signature vectors. Cluster index assignment is based on the state signature vector which 
yields the minimum distance. 
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Table 6.: Empirical 1-step transition probability matrix for 30-minute states, based on identified temporal cluster configuration. State 
transitions with a probability > 0 are highlighted in green. 
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Figure 14.: Estimated 15-minute clusters using identified state signature vectors. The Euclidean distance is calculated between each 
temporal period’s feature vector and the state signature vectors. Cluster index assignment is based on the state signature vector which 
yields the minimum distance. 
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Table 7.: Empirical 1-step transition probability matrix for 15-minute states, based on identified temporal cluster configuration. State 
transitions with a probability > 0 are highlighted in green. 
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Figure 15.: Estimated 5-minute clusters using identified state signature vectors. The Euclidean distance is calculated between each 
temporal period’s feature vector and the state signature vectors. Cluster index assignment is based on the state signature vector which 
yields the minimum distance. 
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Table 8.: Empirical 1-step transition probability matrix for 5-minute states, based on identified temporal cluster configuration. State 
transitions with a probability > 0 are highlighted in green. 
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Boxplot of Euclidean distance of best-match 60-min state assignments Boxplot of Euclidean distance of best-match 30-min state assignments 

ex-ante vs ex-posf ex-ante vs ex-post 



ex-ante ex-post ex-ante ex-post 


(a) 60-minute states 


(b) 30-minute states 


Boxplot of Euclidean distance of best-match 15-min state assignments Boxplot of Euclidean distance of best-match S-min state assignments 

ex-ante vs ex-posf ex-ante vs ex-posf 



ex-ante ex-post ex-ante ex-post 


(c) 15-minute states (d) 5-minute states 

Figure 16.: Measuring the stability of the online state assignment algorithm out-of-sample. Given that the state assignment of an 
online FV is based on the minimum Euclidean distance to predetermined SSVs, we compute the best match distance for each of the 
FVs in a sample and use a boxplot to visualise the empirical distribution. In this figure, we compare the ex-ante (Ol-Nov-2012 to 
30-Nov-2012, same period used for SSV estimation) and ex-post (03-Dec-2012 to 07-Dec-2012, one week after SSV estimation window) 
periods. 
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9. Conclusion 

In this paper, we have outlined a novel approach for the unsupervised detection of intraday temporal 
market states at varying time scales, as well as a proposed mechanism for significant state selection 
and online state estimation. Using the maximum likelihood approach of Giada and Marsili (2001), 
we show that the technique can be used to cluster temporal periods as objects based on market 
microstructure feature performance. A high-speed PGA was used for cluster detection, with a 
computation time conducive to overnight or even intraday calibration of market states. A study 
of temporal cluster configurations and power-law fits to 60-minute, 30-minute, 15-minute and 5- 
minute time scales revealed scale-specific system behaviour, motivating the need for scale-specific 
state space reduction for optimal planning of participating trading agents. The proposed scheme 
for online state detection suggested the use of SSVs to capture the market activity signature of 
each identified state, with a simple distance metric of the prevailing FV to determine the state 
index. We showed that the online state detection scheme can be used to enumerate and update 1- 
step transition probability matrices, which can be used for optimal planning in the high-frequency 
trading domain. We considered the stability of the algorithm ex-post and found that we could 
reliably determine 30-minute, 15-minute and 5-minute states using the proposed algorithm, whereas 
60-minute states were less stable. 

While this paper demonstrates a feasible framework for temporal state detection, further research 
should consider a longer-term study to determine the stability of identified states and explore al¬ 
ternative propositions for features, state signature extraction and online detection. In the South 
African equity market, the impact of significant infrastructure changes (e.g. exchange server mi¬ 
gration, fee model modifications, co-located trading servers) on temporal system behaviour can be 
considered. 
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Appendix A: The Noh-Giada-Marsili coupling parameters 

According to Noh (2000), the generative model of the price associated with the stock can be 
written as 


X^{t) 


9s,Vs. + a/1 - gie 


(Al) 


where the cluster-related influences are driven by t]s- and the stock-specific influences by e*. Both 
innovations are treated as Gaussian random variables with unit variance and zero mean^. The 
relative contribution is controlled by the intra-cluster coupling parameter Qs-- The Noh-Giada- 
Marsili model encodes the idea that stocks which have something in common belong in the same 
cluster. This comes with the caveat that stock membership in clusters is mutually exclusive and 
intra-cluster correlations are positive. 

From Equation Al we compute the covariance for the and stocks 

E[X,{t)X,{t)] = glE[7^s,Vs,] + (1 - glMe^ej]. (A4) 

Using the assumption of unit variance and zero mean for both the shared component (gsi) and 
stock component (cj) processes, the correlation between stock i and j is given by 

Cij = gsi^SiSj + (1 ~ g‘l^)^ij- {Xh) 

The following cluster relations can be derived, where Ug is the index of stock in the cluster and 
Cg is the internal correlation of the cluster, given that clusters are mutually exclusive 

N N 

^ ^ ^SjS ; Cg — ^ ^ Cij5g^g5g^g. {.X&) 

i=l ^^=1 

From Equation A5, for Si = Sj = s, we have Cij ss gl (Giada and Marsili (2001)). We can multiply 
both sides of Equation A5 by dg.gdg.g and sum over all i and j to find 


^ y ^ij^SjS^SjS — ^ g Si^SjsXsjS^SjS T ^ ^ (1 ggi)^ij^SiS^SjS • 

i,j i,j i,j 


(A7) 


To sum out the delta functions over the clusters and stocks from 


Y^Cij5g,g5g^g = '^{gg^5g^gY^5g^g^Sg^g) + ^ 

i,j i j i j 


(A8) 


^This form of the price model ensures that the self correlation of a stock is one and independent of the cluster coupling. This 
can be seen by computing the self correlation E[x‘^] and using that clusters and stock unique process are unit variance zero 
mean processes 


HidsiTisi + - gsi^if] = aX + (1 - aX) = i- 

This is not a unique choice, another possible choice often used is 


E[( 

yl + asi 


yi + as, 




1 + asj 
1 + asi 


= 1 . 


(A2) 


(A3) 
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we use that 5ij5s^s = ^s,s, ^Sis^^s^s = nsds^s and <5^.^ = J2i to find 

Cij6s,sSs,s = gins ^ 5s,s + (1 - gl) ^ ^s,s- (A9) 

i,j i i 

By combining Equations A6 and A9, we get 

Cs = gWs + (1 - gl)ns = glinl - Us) - Us. (AlO) 

This is can be rearranged to finally obtain an expression for the intra-cluster coupling parameter 
for cluster s, 


gs = 



(All) 


Appendix B: The Noh-Giada-Marsili likelihood function 

We evaluate the probability of the data satisfying the model by using the multiplicative property 
of probabilities, 


D N 

p(Ai(i),..., xn{d ))=n n p{x,{d)). 

d=li=l 


(Bl) 


The probability of being in a given state that satisfies the model is given as a delta function, such 
that we sum over all N stocks and all D features (date-times), taking expectations (.. .)r^^e over the 
random processes associated with the stock-specihc noise and the cluster-specihc noise 

D IN _ \ 

^=n (n +^1 - 5 It*)) ) • (B2) 

d—\ \i=l / TO £ 


This takes on the form 


D N 


P=Y\Y\ exp 


d=li=l' 


1 TV ^ TV 


P,<1 


x5 {Xi{d) - - \Jl - g%(-?j ■ 


(B3) 

(B4) 


This is simplified to the following form, where the sum over i stocks is converted to sums of the 
clusters s and the Ug stocks in each cluster 


S D 


p=Y\Y\ n/ 


5=1 d=l ' 


2GS 


1 2 


5 {Xi{d) - gsVs - \/l - gl(^ij ■ 


(B5) 


The Gaussian integral over the delta function is evaluated relative to the e^’s, using that 
n f f{x)6{ax — xo) = n ]^/(®o/fl) over the Ug delta functions. 


S D 

^=nn 

s=l d=l' 


drjg 






n 


i£s 


exp 


1 {ggTjg - Xj)'^ 

2 I-gl 


(B6) 
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Expanding out the integrand and using 


S' D 


drjs 




s=l d=l ' 


1 2 1 ^ {gWs - ‘^QsVsXj + xf) 


2^® 2 


i£s 


Expanding out the sum terms and evaluating where possible 


S D 


p=nn 

s=l d=l' 


drjs 




exp 


I 2 1 nsglril , gsVs ^ ^ 11 

' 2 "^® 2 1-^2 + 2 1 -^,. 

t&s l&S 




This can be further simplified to 


s D 


^ = UU 

s=l d=l ' 


drjs 


(l-5l)^ 


exp 


1 1-gs^+ ^sg.^ 2 Y^X,,- I V 

-tZ * 


2 1-gi 


1 - gs 2 1 - 52 * 

l&S l&S 


We now evaluate the Gaussian integral using that J e ^^dx = \fYj2 and hence that f e 


2a®'‘“ 


s D 


^=nn 


^ (i-g.^)- 

L-\ (1 - g^)"^ + (1 - g^))^ 


X exp 


X exp 


g! 


2M + (l-g|))(l-g|) 


;Ev. 


i£s 


1 1 


21 - g2 




i£s 


Evaluating the product of all D times, where D >> 1, 


^=n 


5=1 


^/^ (1-g.^)^ 

(i-g.2)"M + (i-g2))^ 

,2 


1 D 


X exp 


X exp 


91 


D n. 


2{ns9^ + {l-gm-^-9^) 


, , D n, 

^ d i£s 


(EE^- 


d i£s 


Using that Cij = Yd^i^. 




D N D 

E(E'^')^=E(E XiXj)Ssisdsj,s = Dcs 

d i&s i,j=^ d 


(B7) 

• (B8) 

• (B9) 

■^+bxdx = 

(BIO) 

(Bll) 

(B12) 

(B13) 

(B14) 

(B15) 

(B16) 
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and that the variance of the process in the cluster can be computed from the trace^ 


D 




(B18) 


i£s d i^s 

Substituting Equation B16 and B18 into Equation B15, 


s 


p=n 


TT 




n D 


5=1 

We can rewrite this as 


{1-91)"^ {nsgl + {l-gl)y^ 


exp 


D Us ^ D Cs 


2 21-g^,nsg^, + {l-gj)\ 


(B19) 


S R 


^=n 


TT 2 {Usgi + {1 - gi)) 2 


s=i 




exp 


D 1 


n, — 


csgs 


nsg"^ + (1 - ffs) 


(B20) 


Then using that P (x e we can hnd He oc ln(P) from Equation B20, and using that In Ai = 
X]iln(24j) to hnd the log-likelihood function [Need to use D >> 1 and look at expansion (gs — 


s 


HP) = - y [Hns9s + (1 - gH 

5=1 

-h (n^ - l)ln(l- 5 ^)] 

s 


D 

+ — 


[ln(7r)] 


5 = 1 


D H 1 
~2 


5 = 1 


51 L 


ris - 


^sgs 


W5s + (1 -5l)J 


(B21) 

(B22) 

(B23) 

(B24) 


Using Equation All, we can substitute for gg in All to hnd the log-likelihood entirely in terms of 
Us and Cg, using that (1 - g^) = IT = + (1 - 5s): 


= ) E 


s:ns>0 


1 f 1 M - Cs 

log- h (ns - 1) log ■ 

rig 


Pg - rig 


+ l Y • 


(B25) 


s-.ris >0 


The last term is a constant, given that Yls-n >ons = H where N is the number of objects. This is 
hxed for a given system. Hence the likelihood function required is 


= ( E 


log — (ns - 1) log ■ 


s:n,>0 


rig 


- rig ^ 


(B26) 


up to a constant 2('S'ln(7r) -|- N). 


^The trace of the correlation matrix for each cluster s can be verified from the eigenvalues 

N 

yo Cii = As = (ns - 1)(1 - c/s) + fisfls + (1 - 9 ?) = ns- (B 17 ) 

i s 
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